Datastore: implement barrier if we see "in failed tx" error. #1418

branlwyd · 2023-05-26T22:04:23Z

For some error modes, Postgres will return an error to the caller & then fail all future statements within the same transaction with an "in failed SQL transaction" error. This effectively means one statement will receive a "root cause" error and then all later statements will receive an "in failed SQL transaction" error. In a pipelined scenario, if our code is processing the results of these statements concurrently--e.g. because they are part of a try_join!/try_join_all group--we might receive & handle one of the "in failed SQL transaction" errors before we handle the "root cause" error, which might cause the "root cause" error's future to be cancelled before we evaluate it. If the "root cause" error would trigger a retry, this would mean we would skip a DB-based retry when one was warranted.

To fix this problem, we (internally) wrap all direct DB operations in run_op. This function groups concurrent database operations into "operation groups", which allow us to wait for all operations in the group to complete (this waiting operation is called "draining"). If we ever observe an "in failed SQL transaction" error, we drain the operation group before returning. Under the assumption that the "root cause" error is concurrent with the "in failed SQL transactions" errors, this guarantees we will evaluate the "root cause" error for retry before any errors make their way out of the transaction code.

Closes #1417.

For some error modes, Postgres will return an error to the caller & then fail all future statements within the same transaction with an "in failed SQL transaction" error. This effectively means one statement will receive a "root cause" error and then all later statements will receive an "in failed SQL transaction" error. In a pipelined scenario, if our code is processing the results of these statements concurrently--e.g. because they are part of a `try_join!`/`try_join_all` group--we might receive & handle one of the "in failed SQL transaction" errors before we handle the "root cause" error, which might cause the "root cause" error's future to be cancelled before we evaluate it. If the "root cause" error would trigger a retry, this would mean we would skip a DB-based retry when one was warranted. To fix this problem, we (internally) wrap all direct DB operations in `run_op`. This function groups concurrent database operations into "operation groups", which allow us to wait for all operations in the group to complete (this waiting operation is called "draining"). If we ever observe an "in failed SQL transaction" error, we drain the operation group before returning. Under the assumption that the "root cause" error is concurrent with the "in failed SQL transactions" errors, this guarantees we will evaluate the "root cause" error for retry before any errors make their way out of the transaction code.

divergentdave

Very nice!

branlwyd · 2023-05-26T22:34:44Z

(I'm running my test_until_failure.sh script against the janus_daphne test -- it took a few hours to observe a failure on my workstation last time, I'm going to leave it running for awhile before merging.)

inahga

Appreciate the detailed explanations!

branlwyd requested a review from a team as a code owner May 26, 2023 22:04

divergentdave approved these changes May 26, 2023

View reviewed changes

tgeoghegan approved these changes May 26, 2023

View reviewed changes

inahga approved these changes May 26, 2023

View reviewed changes

branlwyd merged commit 7ad8faa into main May 27, 2023

branlwyd deleted the bran/op-group branch May 27, 2023 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datastore: implement barrier if we see "in failed tx" error. #1418

Datastore: implement barrier if we see "in failed tx" error. #1418

branlwyd commented May 26, 2023

divergentdave left a comment

branlwyd commented May 26, 2023

inahga left a comment

Datastore: implement barrier if we see "in failed tx" error. #1418

Datastore: implement barrier if we see "in failed tx" error. #1418

Conversation

branlwyd commented May 26, 2023

divergentdave left a comment

Choose a reason for hiding this comment

branlwyd commented May 26, 2023

inahga left a comment

Choose a reason for hiding this comment